9 research outputs found
Wasserstein Introspective Neural Networks
We present Wasserstein introspective neural networks (WINN) that are both a
generator and a discriminator within a single model. WINN provides a
significant improvement over the recent introspective neural networks (INN)
method by enhancing INN's generative modeling capability. WINN has three
interesting properties: (1) A mathematical connection between the formulation
of the INN algorithm and that of Wasserstein generative adversarial networks
(WGAN) is made. (2) The explicit adoption of the Wasserstein distance into INN
results in a large enhancement to INN, achieving compelling results even with a
single classifier --- e.g., providing nearly a 20 times reduction in model size
over INN for unsupervised generative modeling. (3) When applied to supervised
classification, WINN also gives rise to improved robustness against adversarial
examples in terms of the error reduction. In the experiments, we report
encouraging results on unsupervised learning problems including texture, face,
and object modeling, as well as a supervised classification task against
adversarial attacks.Comment: Accepted to CVPR 2018 (Oral
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
In our work, we explore the synergistic capabilities of pre-trained
vision-and-language models (VLMs) and large language models (LLMs) for visual
commonsense reasoning (VCR). We categorize the problem of VCR into visual
commonsense understanding (VCU) and visual commonsense inference (VCI). For
VCU, which involves perceiving the literal visual content, pre-trained VLMs
exhibit strong cross-dataset generalization. On the other hand, in VCI, where
the goal is to infer conclusions beyond image content, VLMs face difficulties.
We find that a baseline where VLMs provide perception results (image captions)
to LLMs leads to improved performance on VCI. However, we identify a challenge
with VLMs' passive perception, which often misses crucial context information,
leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we
suggest a collaborative approach where LLMs, when uncertain about their
reasoning, actively direct VLMs to concentrate on and gather relevant visual
elements to support potential commonsense inferences. In our method, named
ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem
category, VLM commanders to leverage VLMs differently based on the problem
classification, and visual commonsense reasoners to answer the question. VLMs
will perform visual recognition and understanding. We evaluate our framework on
two VCR benchmark datasets and outperform all other methods that do not require
in-domain supervised fine-tuning
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing
what commonly happens after his/her current action (e.g. crack eggs)? What if
we also know the longer-term goal of the actor (e.g. making egg fried rice)?
The long-term action anticipation (LTA) task aims to predict an actor's future
behavior from video observations in the form of verb and noun sequences, and it
is crucial for human-machine interaction. We propose to formulate the LTA task
from two perspectives: a bottom-up approach that predicts the next actions
autoregressively by modeling temporal dynamics; and a top-down approach that
infers the goal of the actor and plans the needed procedure to accomplish the
goal. We hypothesize that large language models (LLMs), which have been
pretrained on procedure text data (e.g. recipes, how-tos), have the potential
to help LTA from both perspectives. It can help provide the prior knowledge on
the possible next actions, and infer the goal given the observed part of a
procedure, respectively. To leverage the LLMs, we propose a two-stage
framework, AntGPT. It first recognizes the actions already performed in the
observed videos and then asks an LLM to predict the future actions via
conditioned generation, or to infer the goal and plan the whole procedure by
chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2
benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the
effectiveness of our proposed approach. AntGPT achieves state-of-the-art
performance on all above benchmarks, and can successfully infer the goal and
thus perform goal-conditioned "counterfactual" prediction via qualitative
analysis. Code and model will be released at
https://brown-palm.github.io/AntGP
Recommended from our members
Learning Generative Models with Energy-Based Models and Transformer GANs
In this thesis, we study approaches to learn priors on data (i.e. generative modeling) and learners (i.e. meta-learning) for computer vision tasks. We present our approaches to improve the stability and performance of generative modeling and meta-learning methods.
First, we study the use of natural image prior on computer vision tasks. To this end, we introduce a suite of regularization techniques that enhances the performance of energy-based models on realistic image datasets. On generative modeling, we achieve competitive results while using much smaller models. On supervised classification, we observe a significant error reduction against adversarial examples. Our model is the first computer vision model to achieve state-of-the-art image generation and classification within a single model.
Next, we investigate if natural image prior can be learned with less vision-specific inductive biases. To this end, we integrate the Vision Transformer architecture into generative adversarial networks (GAN). We propose novel regularization methods and architectural choices to achieve this goal. The resulting approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on popular image generation benchmarks for the first time.
Lastly, we study a meta-learning approach, which automatically extracts prior knowledge from the set of observed tasks. We present our work on improving the computational complexity of meta-learning. Our approach, named MetaOptNet, offers better few-shot generalization at a modest increase in computational overhead